Finding crime patterns in Montgomery County

Group components

  • Cephas Barreto- cephasax [at] gmail [dot] com>
  • Marco Olimpio - marco.olimpio [at] gmail [dot] com
  • Rebecca Betwel - bekbetwel [at] gmail [dot] com

About the dataset

The data presented is derived from reported crimes classified according to Maryland criminal code and documented by approved police incident reports. The data about crimes do not put info about the victins and its masks the actual address not putting the exact place where the complaint occured.


Source: https://data.world/jboutros/montgomery-county-crime

Maryland County Area


Checking about data available

  • Incident ID: Looks like a simple table identification number
  • CR Number: CR stands for Complaint Register and its a identification for a Compleint Process for a full disciplinary investigation
  • Dispatch Date/Time: Looks the date and time when the complaint was made
  • Class
    • Class number:Identification number of the complaint
    • Class description:Description of the class
  • Complaint
    • Public place
      • Police District Name: Auto describes it
      • Police District Number: Auto describes it
      • Block address: Auto describes it
      • City: Auto describes it
      • State: Auto describes it
      • Zip Code: Auto descrives it
      • Agency: Agency responsable for this address
      • Place: Kind of place where the crime occured
      • Sector
        • Rockville District: Sectors A, B, C
        • Bethesda District: Sectors D, E
        • Silver Spring District: Sector G
        • Wheaton-Glenmont District: Sector J, K
        • Germantown District: Sectors M, N, P
      • Address Number: Auto describes it
      • Beat: Beat is the territory and time that a police officer patrols
      • PRA: Police Reporting Area
      • Latitude: Auto describes it
      • Longitude: Auto describes it
      • Location: Tuple of Latitude and Longitude
    • Complaint estimative
      • Start Data/Time: Start of the complaint
      • End Date/Time: End of the complaint

Dataset questions

  • About Type of complaint
    • Which complaint is most common?
    • What are the categories of complaints?
    • Could we categorize the types of crimes in violent or not?
  • About Period of time/day of the week
    • Wich period of the day that most complaints occur
    • Wich day of the week that most complaints occur
    • Wich month of the years that most complaints occur
    • These complainsts are realted with holidays?
    • What period of time (time of day/day of the week/month of the year) has correlation with the type of complaint
  • About Location
    • Where is most of the complaints?
    • What sort of places have most complaints
    • What sort of place has correlation with the type of complaint
  • Correlation between locale and type of complaint
    • Is there a correlations between the day of the week and kind of complaint?

References

  • https://www.montgomerycountymd.gov/pol/districts/whatsmydistrict.html
  • http://www.ericcarlson.net/scanner/police.html

The begining

Fist of all we need to import libraries needed. NumPy, Pandas, Matplotlib and Bokeh are needed to run the scripts of this notebook. Below we can see how to import these libraries and how to configure Bokeh to show chart inline (calling output_notebook() function)


In [65]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt
import seaborn

#Importing Bokeh library, updated to version 0.12.9
from bokeh.io import push_notebook, show, output_notebook, output_file
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.sampledata.commits import data

from bokeh.models import (
    GMapPlot, GMapOptions, ColumnDataSource, Circle, DataRange1d, PanTool, WheelZoomTool, BoxSelectTool, Jitter
)
from bokeh.core.properties import field
#from bokeh.transform import jitter
#from bokeh.models.transforms import Jitter

#Inline bokeh charts
output_notebook()


Loading BokehJS ...

Loading datasets


In [89]:
#Loading dataset from Montgomery County complaint dataset
monty_data = pd.read_csv("MontgomeryCountyCrime2013.csv")
monty_data.head()


Out[89]:
Incident ID CR Number Dispatch Date / Time Class Class Description Police District Name Block Address City State Zip Code ... Sector Beat PRA Start Date / Time End Date / Time Latitude Longitude Police District Number Location Address Number
0 200939101 13047006 10/02/2013 07:52:41 PM 511 BURG FORCE-RES/NIGHT OTHER 25700 MT RADNOR DR DAMASCUS MD 20872.0 ... NaN NaN NaN 10/02/2013 07:52:00 PM NaN NaN NaN OTHER NaN 25700.0
1 200952042 13062965 12/31/2013 09:46:58 PM 1834 CDS-POSS MARIJUANA/HASHISH GERMANTOWN GUNNERS BRANCH RD GERMANTOWN MD 20874.0 ... M 5M1 470.0 12/31/2013 09:46:00 PM NaN NaN NaN 5D NaN NaN
2 200926636 13031483 07/06/2013 09:06:24 AM 1412 VANDALISM-MOTOR VEHICLE MONTGOMERY VILLAGE OLDE TOWNE AVE GAITHERSBURG MD 20877.0 ... P 6P3 431.0 07/06/2013 09:06:00 AM NaN NaN NaN 6D NaN NaN
3 200929538 13035288 07/28/2013 09:13:15 PM 2752 FUGITIVE FROM JUSTICE(OUT OF STATE) BETHESDA BEACH DR CHEVY CHASE MD 20815.0 ... D 2D1 11.0 07/28/2013 09:13:00 PM NaN NaN NaN 2D NaN NaN
4 200930689 13036876 08/06/2013 05:16:17 PM 2812 DRIVING UNDER THE INFLUENCE BETHESDA BEACH DR SILVER SPRING MD 20815.0 ... D 2D3 178.0 08/06/2013 05:16:00 PM NaN NaN NaN 2D NaN NaN

5 rows × 22 columns

As we described we have all theses coluns in this dataset.


In [90]:
monty_data.columns


Out[90]:
Index(['Incident ID', 'CR Number', 'Dispatch Date / Time', 'Class',
       'Class Description', 'Police District Name', 'Block Address', 'City',
       'State', 'Zip Code', 'Agency', 'Place', 'Sector', 'Beat', 'PRA',
       'Start Date / Time', 'End Date / Time', 'Latitude', 'Longitude',
       'Police District Number', 'Location', 'Address Number'],
      dtype='object')
Analysing the dataset loaded we can check the we have complaints 23369

In [91]:
number_of_registries = monty_data.shape
print(number_of_registries[0])


23369

Which sort of complaints are most common and Pareto Analisys?


In [92]:
#Using the agg function allows you to calculate the frequency for each group using the standard library function len.
#Sorting the result by the aggregated column code_count values, in descending order, then head selecting the top n records, then reseting the frame; will produce the top n frequent records
top = monty_data.groupby(['Class','Class Description'])['Class'].agg({"frequency": len}).sort_values("frequency", ascending=False).head(43).reset_index()
top['frequency'] = (top['frequency']/number_of_registries[0])*100
top


Out[92]:
Class Class Description frequency
0 2812 DRIVING UNDER THE INFLUENCE 7.317386
1 1834 CDS-POSS MARIJUANA/HASHISH 5.708417
2 2938 POL INFORMATION 5.096495
3 614 LARCENY FROM AUTO OVER $200 3.911164
4 617 LARCENY FROM BUILDING OVER $200 3.829860
5 2942 MENTAL TRANSPORT 3.598785
6 1412 VANDALISM-MOTOR VEHICLE 3.260730
7 2941 LOST PROPERTY 3.119517
8 619 LARCENY OTHER OVER $200 2.790021
9 634 LARCENY FROM AUTO UNDER $50 2.610296
10 1013 FORGERY/CNTRFT - IDENTITY THEFT 2.550387
11 613 LARCENY SHOPLIFTING OVER $200 2.263683
12 623 LARCENY SHOPLIFTING $50 - $199 2.092516
13 2216 LIQUOR - DRINK IN PUB OVER 21 2.062562
14 2413 DISORDERLY CONDUCT 1.985536
15 811 ASSAULT & BATTERY - CITIZEN 1.634644
16 624 LARCENY FROM AUTO $50 - $199 1.523386
17 1011 FORGERY/CNTRFT-CRDT CARDS 1.463477
18 2943 MISSING PERSON 1.459198
19 821 SIMPLE ASSAULT - CITIZEN 1.330823
20 1864 CDS IMPLMNT-MARIJUANA/HASHISH 1.270914
21 711 AUTO THEFT - PASSENGER VEHICLE 1.253798
22 813 ASSAULT & BATTERY SPOUSE/PARTNER 1.240960
23 635 LARCENY AUTO PART UNDER $50 1.142539
24 2111 JUVENILE RUNAWAY 1.104027
25 633 LARCENY SHOPLIFTING UNDER $50 1.104027
26 512 BURG FORCE-RES/DAY 1.027002
27 2737 TRESPASSING 0.954256
28 2913 SUDDEN DEATH NATURAL 0.954256
29 627 LARCENY FROM BUILDING $50-$199 0.945697
30 1411 VANDALISM-DWELLING 0.907185
31 639 LARCENY OTHER UNDER $50 0.894347
32 629 LARCENY OTHER $50 - $199 0.864393
33 1014 FORGERY/CNTRFT-ALL OTHER 0.842997
34 513 BURG FORCE-RES/TIME UNK 0.787368
35 616 LARCENY BICYCLE OVER $200 0.761693
36 1417 VANDALISM-OTHER 0.736018
37 637 LARCENY FROM BLDG UNDER $50 0.706064
38 1824 CDS-SELL-MARIJUANA/HASHISH 0.671830
39 514 BURG FORCE-COMM/NIGHT 0.637597
40 2946 RECOVERED PROPERTY/MONT. CO. 0.629038
41 1012 FORGERY/CNTRFT-CHECKS 0.624759
42 511 BURG FORCE-RES/NIGHT 0.616201

As we can see the most common type of complaint is "Driving under influence". Unfortunatelly we do not have the type of subtance influencing the driver like alcohool and other drugs. Altought, interestingly the second most common complaint is "Possession of CDS (Controlled Dangerous Substance) marijuana/hashish". But we also can not correlate the substance of with the drives is influenced with the possession of marijuana/hashish.

The number of 43 Classes of crimes was not a magic number. These 43 classes of crimes make 80% of the total of crimes and the 20% are distributed in the other 242 classes.

The main purpose of this kind of approach is to be right to the point, where the overwhelming majority of the complaints are related. The Police Office or County's board of advisors could make decisions to prevent or mitigate these kinds of problems based on the ammount of crimes. This kind of approach will be time saving :] in the methods that cannot be fully automated, like classification of crimes in violent or not or the description of 'master classes'.


In [70]:
from decimal import *
#Configure precision
getcontext().prec = 2

parcial_perc = top['frequency'].sum()
parcial_perc = round(parcial_perc,2)
tot_classes = monty_data['Class'].value_counts(normalize=True, sort=True).shape[0]
print("The crimes above are responsible for up to " + str(parcial_perc) + "% (Pareto Analysis) of the total crimes. Performed by "+ str(top.shape[0])+" out of "+ str(tot_classes)+" classes of crimes!")
print("For precision purposes we had only " + str(round((43/285)*100,2)) + "% classes that represent 80% of the total of complaints")


The crimes above are responsible for up to 80.29% (Pareto Analysis) of the total crimes. Performed by 43 out of 285 classes of crimes!
For precision purposes we had only 15.09% classes that represent 80% of the total of complaints

What are the Classes of Classes (Master Classes) of complaints?

In terms of granularity we cannot extract much of information utilizing the classification system adopted. So we observed that there is a possibility to aglutinate a set of classes in a 'master class'. That could agregate a number of crimes in a single class as we can see below:


In [93]:
#Creating a master class to categorize crimes
classaux = monty_data["Class"]/100
classaux = classaux.astype(int)
classaux = classaux*100

#Inserting this new data in the dataset
monty_data["MasterClass"] = classaux
monty_data[["Class","Class Description",'MasterClass']].head(5)


Out[93]:
Class Class Description MasterClass
0 511 BURG FORCE-RES/NIGHT 500
1 1834 CDS-POSS MARIJUANA/HASHISH 1800
2 1412 VANDALISM-MOTOR VEHICLE 1400
3 2752 FUGITIVE FROM JUSTICE(OUT OF STATE) 2700
4 2812 DRIVING UNDER THE INFLUENCE 2800

We can observe that line 17 and 19 below are respectively from class 1834 and 1833. Looking at its 'Class Description' both are classified as Controlle Dangerous Substance Possession and could be agregated in a major class 1800.


In [95]:
monty_data[17:25].head(3)


Out[95]:
Incident ID CR Number Dispatch Date / Time Class Class Description Police District Name Block Address City State Zip Code ... Beat PRA Start Date / Time End Date / Time Latitude Longitude Police District Number Location Address Number MasterClass
17 200925988 13030561 07/01/2013 03:26:03 AM 1834 CDS-POSS MARIJUANA/HASHISH SILVER SPRING 12200 NEW HAMPSHIRE AVE SILVER SPRING MD 20904.0 ... 3I1 519.0 07/01/2013 03:26:00 AM NaN 39.055327 -76.995580 3D (39.055326908694369, -76.99557965357296) 12200.0 1800
18 200925991 13030553 07/01/2013 01:03:17 AM 2812 DRIVING UNDER THE INFLUENCE SILVER SPRING 8700 PLYMOUTH ST SILVER SPRING MD 20901.0 ... 3H1 126.0 07/01/2013 01:03:00 AM NaN 38.999406 -77.007736 3D (38.999406266117461, -77.007735832110484) 8700.0 2800
19 200925992 13030559 07/01/2013 01:49:33 AM 1833 CDS-POSS COCAINE& DERIVATIVES GERMANTOWN 12700 MIDDLEBROOK RD GERMANTOWN MD 20874.0 ... 5N1 595.0 07/01/2013 01:49:00 AM NaN 39.176504 -77.263908 5D (39.176504471375559, -77.263907914323482) 12700.0 1800

3 rows × 23 columns

Bellow we ranked by frequency the lista of crimes and it's 'master class'


In [117]:
#Considering the top crimes

#copy
top_classes_top = top

#Creation of a Master Class
top_classes_top['Master Class'] = 0
aux = top_classes_top['Master Class'].astype(float,copy=True)
top_classes_top['Master Class'] = aux
top_classes_top['Master Class'] = top_classes_top['Class']/100
top_classes_top['Master Class'] = top_classes_top['Master Class'].round()
top_classes_top['Master Class'] = top_classes_top['Master Class']*100
aux = top_classes_top['Master Class'].astype(int,copy=True)
top_classes_top['Master Class'] = aux

top_classes_top


Out[117]:
Class Class Description frequency Master Class
0 2812 DRIVING UNDER THE INFLUENCE 7.317386 2800
1 1834 CDS-POSS MARIJUANA/HASHISH 5.708417 1800
2 2938 POL INFORMATION 5.096495 2900
3 614 LARCENY FROM AUTO OVER $200 3.911164 600
4 617 LARCENY FROM BUILDING OVER $200 3.829860 600
5 2942 MENTAL TRANSPORT 3.598785 2900
6 1412 VANDALISM-MOTOR VEHICLE 3.260730 1400
7 2941 LOST PROPERTY 3.119517 2900
8 619 LARCENY OTHER OVER $200 2.790021 600
9 634 LARCENY FROM AUTO UNDER $50 2.610296 600
10 1013 FORGERY/CNTRFT - IDENTITY THEFT 2.550387 1000
11 613 LARCENY SHOPLIFTING OVER $200 2.263683 600
12 623 LARCENY SHOPLIFTING $50 - $199 2.092516 600
13 2216 LIQUOR - DRINK IN PUB OVER 21 2.062562 2200
14 2413 DISORDERLY CONDUCT 1.985536 2400
15 811 ASSAULT & BATTERY - CITIZEN 1.634644 800
16 624 LARCENY FROM AUTO $50 - $199 1.523386 600
17 1011 FORGERY/CNTRFT-CRDT CARDS 1.463477 1000
18 2943 MISSING PERSON 1.459198 2900
19 821 SIMPLE ASSAULT - CITIZEN 1.330823 800
20 1864 CDS IMPLMNT-MARIJUANA/HASHISH 1.270914 1900
21 711 AUTO THEFT - PASSENGER VEHICLE 1.253798 700
22 813 ASSAULT & BATTERY SPOUSE/PARTNER 1.240960 800
23 635 LARCENY AUTO PART UNDER $50 1.142539 600
24 2111 JUVENILE RUNAWAY 1.104027 2100
25 633 LARCENY SHOPLIFTING UNDER $50 1.104027 600
26 512 BURG FORCE-RES/DAY 1.027002 500
27 2737 TRESPASSING 0.954256 2700
28 2913 SUDDEN DEATH NATURAL 0.954256 2900
29 627 LARCENY FROM BUILDING $50-$199 0.945697 600
30 1411 VANDALISM-DWELLING 0.907185 1400
31 639 LARCENY OTHER UNDER $50 0.894347 600
32 629 LARCENY OTHER $50 - $199 0.864393 600
33 1014 FORGERY/CNTRFT-ALL OTHER 0.842997 1000
34 513 BURG FORCE-RES/TIME UNK 0.787368 500
35 616 LARCENY BICYCLE OVER $200 0.761693 600
36 1417 VANDALISM-OTHER 0.736018 1400
37 637 LARCENY FROM BLDG UNDER $50 0.706064 600
38 1824 CDS-SELL-MARIJUANA/HASHISH 0.671830 1800
39 514 BURG FORCE-COMM/NIGHT 0.637597 500
40 2946 RECOVERED PROPERTY/MONT. CO. 0.629038 2900
41 1012 FORGERY/CNTRFT-CHECKS 0.624759 1000
42 511 BURG FORCE-RES/NIGHT 0.616201 500

In [126]:
top_classes_top[['Class Description','frequency']].plot.bar()


Out[126]:
<matplotlib.axes._subplots.AxesSubplot at 0x1171cff60>

Configuring description of the 'Master Classes'

Analysing the descriptions of the crimes is common to see that the 'class description' are separated with a hyphen sign but not all master classes of crimes could be generalized to adopt the left portion of '-' description. As we can notice below the master class 2900 was classified as 'Misc.' because there is more than one type of crime related and we did not found a better derscription. Its important to notice that we only worked with the top complaints (Pareto). And because of the 20/80 analysis we only have to threat 14 'master classes'.


In [75]:
#Inserting the description of the Master Classes
top_classes_top['Master Class Description'] ='' 

top_classes_top[top_classes_top['Master Class'] == 600]
test_top = top_classes_top


test_top.loc[(test_top['Master Class'] ==  600),'Master Class Description'] = 'LARCENY'
test_top.loc[(test_top['Master Class'] == 2900),'Master Class Description'] = 'MISC'
test_top.loc[(test_top['Master Class'] == 1400),'Master Class Description'] = 'VANDALISM'
test_top.loc[(test_top['Master Class'] == 1000),'Master Class Description'] = 'FORGERY/CNTRFT'
test_top.loc[(test_top['Master Class'] ==  500),'Master Class Description'] = 'BURGLARY'
test_top.loc[(test_top['Master Class'] ==  800),'Master Class Description'] = 'ASSAULT & BATTERY'
test_top.loc[(test_top['Master Class'] == 1800),'Master Class Description'] = 'CONTROLLED DANGEROUS SUBSTANCE POSSESSION'
test_top.loc[(test_top['Master Class'] ==  700),'Master Class Description'] = 'THEFT'
test_top.loc[(test_top['Master Class'] == 2100),'Master Class Description'] = 'JUVENILE RUNAWAY'
test_top.loc[(test_top['Master Class'] == 2800),'Master Class Description'] = 'DRIVING UNDER THE INFLUENCE'
test_top.loc[(test_top['Master Class'] == 1900),'Master Class Description'] = 'CONTROLLED DANGEROUS SUBSTANCE IMPLMNT'
test_top.loc[(test_top['Master Class'] == 2200),'Master Class Description'] = 'LIQUOR - DRINK IN PUB OVER 21'
test_top.loc[(test_top['Master Class'] == 2400),'Master Class Description'] = 'DISORDERLY CONDUCT'
test_top.loc[(test_top['Master Class'] == 2700),'Master Class Description'] = 'TRESPASSING'

test_top


Out[75]:
Class Class Description frequency Master Class Master Class Description
0 2812 DRIVING UNDER THE INFLUENCE 7.317386 2800 DRIVING UNDER THE INFLUENCE
1 1834 CDS-POSS MARIJUANA/HASHISH 5.708417 1800 CONTROLLED DANGEROUS SUBSTANCE POSSESSION
2 2938 POL INFORMATION 5.096495 2900 MISC
3 614 LARCENY FROM AUTO OVER $200 3.911164 600 LARCENY
4 617 LARCENY FROM BUILDING OVER $200 3.829860 600 LARCENY
5 2942 MENTAL TRANSPORT 3.598785 2900 MISC
6 1412 VANDALISM-MOTOR VEHICLE 3.260730 1400 VANDALISM
7 2941 LOST PROPERTY 3.119517 2900 MISC
8 619 LARCENY OTHER OVER $200 2.790021 600 LARCENY
9 634 LARCENY FROM AUTO UNDER $50 2.610296 600 LARCENY
10 1013 FORGERY/CNTRFT - IDENTITY THEFT 2.550387 1000 FORGERY/CNTRFT
11 613 LARCENY SHOPLIFTING OVER $200 2.263683 600 LARCENY
12 623 LARCENY SHOPLIFTING $50 - $199 2.092516 600 LARCENY
13 2216 LIQUOR - DRINK IN PUB OVER 21 2.062562 2200 LIQUOR - DRINK IN PUB OVER 21
14 2413 DISORDERLY CONDUCT 1.985536 2400 DISORDERLY CONDUCT
15 811 ASSAULT & BATTERY - CITIZEN 1.634644 800 ASSAULT & BATTERY
16 624 LARCENY FROM AUTO $50 - $199 1.523386 600 LARCENY
17 1011 FORGERY/CNTRFT-CRDT CARDS 1.463477 1000 FORGERY/CNTRFT
18 2943 MISSING PERSON 1.459198 2900 MISC
19 821 SIMPLE ASSAULT - CITIZEN 1.330823 800 ASSAULT & BATTERY
20 1864 CDS IMPLMNT-MARIJUANA/HASHISH 1.270914 1900 CONTROLLED DANGEROUS SUBSTANCE IMPLMNT
21 711 AUTO THEFT - PASSENGER VEHICLE 1.253798 700 THEFT
22 813 ASSAULT & BATTERY SPOUSE/PARTNER 1.240960 800 ASSAULT & BATTERY
23 635 LARCENY AUTO PART UNDER $50 1.142539 600 LARCENY
24 2111 JUVENILE RUNAWAY 1.104027 2100 JUVENILE RUNAWAY
25 633 LARCENY SHOPLIFTING UNDER $50 1.104027 600 LARCENY
26 512 BURG FORCE-RES/DAY 1.027002 500 BURGLARY
27 2737 TRESPASSING 0.954256 2700 TRESPASSING
28 2913 SUDDEN DEATH NATURAL 0.954256 2900 MISC
29 627 LARCENY FROM BUILDING $50-$199 0.945697 600 LARCENY
30 1411 VANDALISM-DWELLING 0.907185 1400 VANDALISM
31 639 LARCENY OTHER UNDER $50 0.894347 600 LARCENY
32 629 LARCENY OTHER $50 - $199 0.864393 600 LARCENY
33 1014 FORGERY/CNTRFT-ALL OTHER 0.842997 1000 FORGERY/CNTRFT
34 513 BURG FORCE-RES/TIME UNK 0.787368 500 BURGLARY
35 616 LARCENY BICYCLE OVER $200 0.761693 600 LARCENY
36 1417 VANDALISM-OTHER 0.736018 1400 VANDALISM
37 637 LARCENY FROM BLDG UNDER $50 0.706064 600 LARCENY
38 1824 CDS-SELL-MARIJUANA/HASHISH 0.671830 1800 CONTROLLED DANGEROUS SUBSTANCE POSSESSION
39 514 BURG FORCE-COMM/NIGHT 0.637597 500 BURGLARY
40 2946 RECOVERED PROPERTY/MONT. CO. 0.629038 2900 MISC
41 1012 FORGERY/CNTRFT-CHECKS 0.624759 1000 FORGERY/CNTRFT
42 511 BURG FORCE-RES/NIGHT 0.616201 500 BURGLARY

It's notorious that "Driving under the influency" is the most common crime commited but when we agregate classes we could take another understading of the most common crimes. As we can see below, the master class 600, the code to 'Larceny', is the most common type of crime with 25.44% of the TOP43 complaints.


In [78]:
test_top['Master Class'].value_counts(sort=True)


Out[78]:
600     14
2900     6
1000     4
500      4
1400     3
800      3
1800     2
700      1
2100     1
2800     1
1900     1
2200     1
2400     1
2700     1
Name: Master Class, dtype: int64

In [107]:
test_top.groupby(['Master Class']).sum()


Out[107]:
Class frequency
Master Class
500 2050 3.068167
600 8760 25.439685
700 711 1.253798
800 2445 4.206427
1000 4050 5.481621
1400 4240 4.903933
1800 3658 6.380247
1900 1864 1.270914
2100 2111 1.104027
2200 2216 2.062562
2400 2413 1.985536
2700 2737 0.954256
2800 2812 7.317386
2900 17623 14.857290

Could we categorize the types of crimes in violent or not?

According to wikipedia (https://en.wikipedia.org/wiki/Violent_crime)violent crimes include, but are not limited to, this list of crimes: Typically, violent criminals includes aircraft hijackers, bank robbers, muggers, burglars, terrorists, carjackers, rapists, kidnappers, torturers, active shooters, murderers, gangsters, drug cartels, and others.

Starting from the crimes in the pareto analysis we noticed that only tree master classes could be considered violent, that are:

  • 500 - BURGLARY
  • 700 - THEFT
  • 800 - ASSAULT & BATTERY

In [110]:
test_top['Violent crime'] = False

test_top.loc[(test_top['Master Class'] ==  500),'Violent crime'] = True
test_top.loc[(test_top['Master Class'] ==  800),'Violent crime'] = True
test_top.loc[(test_top['Master Class'] ==  700),'Violent crime'] = True

test_top.sort_values(['Violent crime', 'frequency'], ascending=False, axis=0, kind='quicksort')


Out[110]:
Class Class Description frequency Master Class Master Class Description Violent crime
15 811 ASSAULT & BATTERY - CITIZEN 1.634644 800 ASSAULT & BATTERY True
19 821 SIMPLE ASSAULT - CITIZEN 1.330823 800 ASSAULT & BATTERY True
21 711 AUTO THEFT - PASSENGER VEHICLE 1.253798 700 THEFT True
22 813 ASSAULT & BATTERY SPOUSE/PARTNER 1.240960 800 ASSAULT & BATTERY True
26 512 BURG FORCE-RES/DAY 1.027002 500 BURGLARY True
34 513 BURG FORCE-RES/TIME UNK 0.787368 500 BURGLARY True
39 514 BURG FORCE-COMM/NIGHT 0.637597 500 BURGLARY True
42 511 BURG FORCE-RES/NIGHT 0.616201 500 BURGLARY True
0 2812 DRIVING UNDER THE INFLUENCE 7.317386 2800 DRIVING UNDER THE INFLUENCE False
1 1834 CDS-POSS MARIJUANA/HASHISH 5.708417 1800 CONTROLLED DANGEROUS SUBSTANCE POSSESSION False
2 2938 POL INFORMATION 5.096495 2900 MISC False
3 614 LARCENY FROM AUTO OVER $200 3.911164 600 LARCENY False
4 617 LARCENY FROM BUILDING OVER $200 3.829860 600 LARCENY False
5 2942 MENTAL TRANSPORT 3.598785 2900 MISC False
6 1412 VANDALISM-MOTOR VEHICLE 3.260730 1400 VANDALISM False
7 2941 LOST PROPERTY 3.119517 2900 MISC False
8 619 LARCENY OTHER OVER $200 2.790021 600 LARCENY False
9 634 LARCENY FROM AUTO UNDER $50 2.610296 600 LARCENY False
10 1013 FORGERY/CNTRFT - IDENTITY THEFT 2.550387 1000 FORGERY/CNTRFT False
11 613 LARCENY SHOPLIFTING OVER $200 2.263683 600 LARCENY False
12 623 LARCENY SHOPLIFTING $50 - $199 2.092516 600 LARCENY False
13 2216 LIQUOR - DRINK IN PUB OVER 21 2.062562 2200 LIQUOR - DRINK IN PUB OVER 21 False
14 2413 DISORDERLY CONDUCT 1.985536 2400 DISORDERLY CONDUCT False
16 624 LARCENY FROM AUTO $50 - $199 1.523386 600 LARCENY False
17 1011 FORGERY/CNTRFT-CRDT CARDS 1.463477 1000 FORGERY/CNTRFT False
18 2943 MISSING PERSON 1.459198 2900 MISC False
20 1864 CDS IMPLMNT-MARIJUANA/HASHISH 1.270914 1900 CONTROLLED DANGEROUS SUBSTANCE IMPLMNT False
23 635 LARCENY AUTO PART UNDER $50 1.142539 600 LARCENY False
24 2111 JUVENILE RUNAWAY 1.104027 2100 JUVENILE RUNAWAY False
25 633 LARCENY SHOPLIFTING UNDER $50 1.104027 600 LARCENY False
27 2737 TRESPASSING 0.954256 2700 TRESPASSING False
28 2913 SUDDEN DEATH NATURAL 0.954256 2900 MISC False
29 627 LARCENY FROM BUILDING $50-$199 0.945697 600 LARCENY False
30 1411 VANDALISM-DWELLING 0.907185 1400 VANDALISM False
31 639 LARCENY OTHER UNDER $50 0.894347 600 LARCENY False
32 629 LARCENY OTHER $50 - $199 0.864393 600 LARCENY False
33 1014 FORGERY/CNTRFT-ALL OTHER 0.842997 1000 FORGERY/CNTRFT False
35 616 LARCENY BICYCLE OVER $200 0.761693 600 LARCENY False
36 1417 VANDALISM-OTHER 0.736018 1400 VANDALISM False
37 637 LARCENY FROM BLDG UNDER $50 0.706064 600 LARCENY False
38 1824 CDS-SELL-MARIJUANA/HASHISH 0.671830 1800 CONTROLLED DANGEROUS SUBSTANCE POSSESSION False
40 2946 RECOVERED PROPERTY/MONT. CO. 0.629038 2900 MISC False
41 1012 FORGERY/CNTRFT-CHECKS 0.624759 1000 FORGERY/CNTRFT False

From the classification made above over the most common crimes (80%) we could make a statement that only 8.53% of these kind of crimes are violent.


In [111]:
value_percentage = test_top[test_top['Violent crime'] == True]['frequency'].sum()
value_percentage = round(value_percentage,2)
print(str(value_percentage) + '% of the top43 crimes are violent')


8.53% of the top43 crimes are violent

In [130]:
valores = [100-value_percentage, value_percentage]
marcacoes = 'Não violentos', 'Violentos'

plt.pie(valores, labels=marcacoes, autopct='%1.2f%%', shadow=True)


Out[130]:
([<matplotlib.patches.Wedge at 0x117a64198>,
  <matplotlib.patches.Wedge at 0x117a6aa58>],
 [<matplotlib.text.Text at 0x117a64f28>,
  <matplotlib.text.Text at 0x117a73828>],
 [<matplotlib.text.Text at 0x117a6a4e0>,
  <matplotlib.text.Text at 0x117a73da0>])

Wich period (morning, afternoon, night) of the day that most complaints occur

Processing data from 'Dispatch Date/Time'


In [13]:
datetime = pd.to_datetime(monty_data['Dispatch Date / Time'])#Takes too long to process

In [112]:
#Considering the top crimes

date_data = monty_data[['Dispatch Date / Time', 'Class']]
#Creation of a Master Class
date_data['Master Class'] = 0
aux = date_data['Master Class'].astype(float,copy=True)
date_data['Master Class'] = aux
date_data['Master Class'] = date_data['Class']/100
date_data['Master Class'] = date_data['Master Class'].round()
date_data['Master Class'] = date_data['Master Class']*100
aux = date_data['Master Class'].astype(int,copy=True)
date_data['Master Class'] = aux
#date_data.head(5)
date_data['Period of the day']=''
#print("-------------")
#TODO make a better version of this cell, the cast not worked, work around


/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

For select period of times we simply filtered the column 'Dispatch Date / Time', transformed in a proper type (datetime) with

pd.to_datetime(monty_data['Dispatch Date / Time'])

extracted only the hour out of the hole datetime structure and filtered according the period of the day. So, Morning period received all complaint starting from 5AM to 12AM. Afternoon from 1AM to 6PM and Night from 7PM to 4AM as we can se after:


In [131]:
morning = datetime[(datetime.dt.hour > 5) & (datetime.dt.hour <= 12)]
afternoon = datetime[(datetime.dt.hour > 12) & (datetime.dt.hour <= 18)]
night = datetime[((datetime.dt.hour > 18) & (datetime.dt.hour <= 23)) | (datetime.dt.hour > 0) & (datetime.dt.hour <= 4)]

print("In the morning we computed "+ str(morning.shape[0]) + " crimes. There is a probability of " + str(round((morning.shape[0]/23369)*100,2)) + "% of a complaint be registered in the morning") 
print("At afternoon we computed "+ str(afternoon.shape[0]) + " crimes. There is a probability of " + str(round((afternoon.shape[0]/23369)*100,2)) + "% of a complaint be registered at afternoon") 
print("In the night we computed "+ str(night.shape[0]) + " crimes. There is a probability of " + str(round((night.shape[0]/23369)*100,2)) + "% of a complaint be registered in the morning")


In the morning we computed 8034 crimes. There is a probability of 34.38% of a complaint be registered in the morning
At afternoon we computed 6898 crimes. There is a probability of 29.52% of a complaint be registered at afternoon
In the night we computed 7311 crimes. There is a probability of 31.29% of a complaint be registered in the morning

Wich day of the week that most complaints occur

Based on the information of the column 'Dispatch Date/Time' filtered the dates according to it's weekdays. And bellow we have that TUESDAY has the majority of occurences. That information is followed by a bar chart showing the frequency of each day.


In [135]:
day_of_the_week = datetime.dt.weekday_name
result = day_of_the_week.value_counts()

print('Tuesday is the day more probable to happen a crime')
result.to_frame


Tuesday is the day more probable to happen a crime
Out[135]:
<bound method Series.to_frame of Tuesday      3836
Monday       3734
Wednesday    3611
Friday       3594
Thursday     3404
Saturday     2807
Sunday       2383
Name: Dispatch Date / Time, dtype: int64>

In [115]:
fig, ax = plt.subplots()

ind = np.arange(0, 7)
Tue = plt.bar(0, height=(result[0]/23369)*100, color='red')
Mon = plt.bar(1, height=(result[1]/23369)*100, color='green')
Wed = plt.bar(2, height=(result[2]/23369)*100, color='blue')
Fri = plt.bar(3, height=(result[3]/23369)*100, color='black')
Thu = plt.bar(4, height=(result[4]/23369)*100, color='brown')
Sat = plt.bar(5, height=(result[5]/23369)*100, color='yellow')
Sun = plt.bar(6, height=(result[6]/23369)*100, color='orange')

ax.set_xticks(ind)
ax.set_xticklabels(['Tuesday','Monday','Wednesday','Friday','Thursday','Saturday','Sunday'])

ax.set_ylim([0, 20])
plt.xticks(ind,rotation='45')
ax.set_ylabel('Percentage')
ax.set_title('Number of crimes per day of the week')

plt.show()


Analysing the distribution of violent crimes.


Looking for a way to better see how the 'Dispatch Date/Time' is distributed according the period of the day and day of the week we found a great example of a scatter chart (https://bokeh.pydata.org/en/latest/docs/user_guide/categorical.html#adding-jitter). Unfortunatelly there is a problem importing jitter to this notebook!


In [147]:
datetime_analys = pd.to_datetime(monty_data[(monty_data['MasterClass']==500) | (monty_data['MasterClass']==600) | (monty_data['MasterClass']==800)]['Dispatch Date / Time'])#Takes too long to process

In [148]:
#output_file("categorical_scatter_jitter.html")

DAYS = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']
dow = {0 : 'Sun', 1:'Mon', 2:'Tue', 3:'Wed', 4:'Thu', 5:'Fri', 6:'Sat'}
#copy of original datetime
scatter_datetime = datetime_analys
scatter_datetime = scatter_datetime.rename('complete_date')
scatter_dayofweek = datetime_analys.dt.dayofweek

scatter_dayofweek = scatter_dayofweek.rename('day_of_week')
scatter_dayofweek=scatter_dayofweek.replace(dow)

scatter_hour = datetime_analys.dt.time
scatter_hour = scatter_hour.rename('hour')

dictionary = {'datetime': scatter_datetime, 'day': scatter_dayofweek, 'time':scatter_hour}
scatter_complete = pd.DataFrame(data=dictionary)

source = ColumnDataSource(scatter_complete)

p = figure(plot_width=950, plot_height=500, y_range=DAYS, x_axis_type='datetime',
           title="Violent complaints by Time of Day (US/Central) - Montgomey County - 2013")

#p.circle(x='time', y=Jitter('day', width=0.6, range=p.y_range),  source=source, alpha=0.3)
p.circle(x='time', y='day',  source=source, alpha=0.1)
#p.circle(x='time', y=Jitter('day', width=0.6, mean=0, distribution='uniform', range=p.y_range,),  source=source, alpha=0.3)
#p.circle(x='time', y=Jitter(name='day', width=0.6, mean=0, distribution='uniform'),  source=source, alpha=0.3)
#p.circle(x='time', y='day',  source=source, alpha=0.3, size=1)
#field_name, width, mean=0, distribution='uniform', range=None
#p.circle(x='time', y=field('day', Jitter(mean=0, width=0.6, distribution='uniform', range=None)),  source=source, alpha=0.3)
p.xaxis[0].formatter.days = ['%Hh']
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None

In [149]:
show(p,notebook_handle=True)


Out[149]:

<Bokeh Notebook handle for In[275]>

As we can see above there is a concentration starting from aroud 7:00AM in the majority of the days. On Friday and Saturday the are fewer complaints registered.

Making the same kind of plot with the data of the 'Start Time' the is visible the difference between the two plots. I this second approach there is a shift in the time of the day as we can see below.


In [150]:
datetime_analys_start = pd.to_datetime(monty_data[(monty_data['MasterClass']==500) | (monty_data['MasterClass']==600) | (monty_data['MasterClass']==800)]['Start Date / Time'])#Takes too long to process

In [151]:
scatter_datetime_start = datetime_analys_start
scatter_datetime_start = scatter_datetime_start.rename('complete_date')
scatter_dayofweek_start = datetime_analys_start.dt.dayofweek

scatter_dayofweek_start = scatter_dayofweek_start.rename('day_of_week')
scatter_dayofweek_start = scatter_dayofweek_start.replace(dow)

scatter_hour_start = datetime_analys_start.dt.time
scatter_hour_start = scatter_hour_start.rename('hour')

dictionary_start = {'datetime': scatter_datetime_start, 'day': scatter_dayofweek_start, 'time':scatter_hour_start}
scatter_complete_start = pd.DataFrame(data=dictionary_start)

source_start = ColumnDataSource(scatter_complete_start)

p = figure(plot_width=950, plot_height=500, y_range=DAYS, x_axis_type='datetime',
           title="Violent complaints by Time of Day (US/Central) - Start Time - Montgomey County - 2013")

p.circle(x='time', y='day',  source=source_start, alpha=0.1)
p.xaxis[0].formatter.days = ['%Hh']
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None

In [152]:
show(p,notebook_handle=True)


Out[152]:

<Bokeh Notebook handle for In[275]>

Wich month of the years that most complaints occur

Now is time to see how are the behaviour according to the month of the Dispatch Time. First of all, we can detect there is a lack of information from the first semester of the year, chart below. So, considering only the second half of the year the month with more complaints is OCTOBER with almost 17.5%


In [153]:
month_of_the_year = datetime.dt.month

result_month = month_of_the_year.value_counts()

print('October is the month more likely to happen a crime with a probability of '+ str((result_month[10]/23369)*100)+'%')
number_of_crimes = sum(result_month)
print('Total number of complaints: ' + str(number_of_crimes))
result_month.to_frame

fig1, ax1 = plt.subplots()

ind = np.arange(0, 13)
Oct = plt.bar(10,height=(result_month[10]/23369)*100,color='indigo')
Aug = plt.bar(8,height=(result_month[8]/23369)*100,color='cyan')
Nov = plt.bar(11,height=(result_month[11]/23369)*100,color='blue')
Sep = plt.bar(9,height=(result_month[9]/23369)*100,color='lightblue')
Dec = plt.bar(12,height=(result_month[12]/23369)*100,color='purple')
Jul = plt.bar(7,height=(result_month[7]/23369)*100,color='magenta')

ax.set_xticks(ind)
ax.set_xticklabels(['January','February','March','April','May','June','July','August','September','October','November','December'])
ax.set_ylim([0, 20])
plt.xticks(ind,rotation='45')
#plt.xlabel.
ax.set_ylabel('Percentage')
ax.set_title('Number of crimes per month of the year')

plt.show()


October is the month more likely to happen a crime with a probability of 17.4376310497%
Total number of complaints: 23369

Gattering all together we have that in 2013, if you are in the month of September in the period of the morning of a Tuesday is better you be cautious because that is the combination more probable to you commit a crime or be victim of a crime #GoodLuck :)

Location of crimes

The dataset contains the latitude and longitude of the complaints. This data could be utilized to see in a map how the crimes are distributed.

Configuring maps and loading data about where the complaints have occured. Observe, to sucesfully configure the Google Maps you have to create an API Key (You can generate one from this site: https://developers.google.com/maps/documentation/javascript/get-api-key) and change in the line 'plot.api_key = ""'


In [154]:
map_options = GMapOptions(lat=39.151040, lng=-77.193020, map_type="roadmap", zoom=11)

plot = GMapPlot(x_range=DataRange1d(), y_range=DataRange1d(), map_options=map_options)
plot.title.text = "Montgomery County"

# For GMaps to function, Google requires you obtain and enable an API key:
#
#     https://developers.google.com/maps/documentation/javascript/get-api-key
#
# Replace the value below with your personal API key:
plot.api_key = "AIzaSyBFHmpkUOfk2FtDZXHVBSUUHp6LVPmI-fs"

Load data in using read_csv function and filtering Latitude and Longitude data. After this showing a preview of the data loaded.


In [29]:
#Reference https://www.census.gov/data/tables/2016/demo/popest/total-cities-and-towns.html 
#https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?fpt=table
pop_data = pd.read_csv("PEP_2016_PEPANNRES.csv",sep=';')

#Adjusting data
#Removing city configuration
pop_data['GEO.display-label'] = pop_data['GEO.display-label'].str.replace("city, Maryland","")
pop_data['GEO.display-label'] = pop_data['GEO.display-label'].str.replace("town, Maryland","")
pop_data['GEO.display-label'] = pop_data['GEO.display-label'].str.replace("village, Maryland","")

pop_data=pop_data.rename_axis({'GEO.display-label': "City"}, axis="columns")
pop_data['City'] = pop_data['City'].str.upper()
pop_data['City'] = pop_data['City'].str.strip() #Merge not working 'cause of a ' '

#Now the data was merged with data from population
pop_data_montydata = monty_data.merge(pop_data, left_on=['City'], right_on=['City'], how='inner')
pop_data[['City','respop72013']].sort_values(by=['respop72013'],ascending=False)


Out[29]:
City respop72013
7 GAITHERSBURG 65761
15 ROCKVILLE 63736
17 TAKOMA PARK 17503
14 POOLESVILLE 5068
2 CHEVY CHASE 2930
10 KENSINGTON 2331
6 CHEVY CHASE VILLAGE 2026
16 SOMERSET 1249
8 GARRETT PARK 1026
12 MARTIN'S ADDITIONS 980
5 CHEVY CHASE VIEW 967
4 CHEVY CHASE SECTION THREE 777
3 CHEVY CHASE SECTION FIVE 698
13 NORTH CHEVY CHASE 575
18 WASHINGTON GROVE 551
11 LAYTONSVILLE 367
9 GLEN ECHO 265
0 BARNESVILLE 178
1 BROOKEVILLE 132

Which city/town/village/small village and etc are responsible


In [30]:
monty_data.columns
#monty_data[monty_data['City']].value_counts
cities_data = pd.DataFrame({'freq':monty_data['City'].value_counts(normalize=True,sort=True,ascending=False)*100,'count':monty_data['City'].value_counts(sort=True,ascending=False)})
cities_data['per_hundred_thousand'] = cities_data['count']/100000 
#cities_data.merge(pop_data, left_on=['object'], right_on=['City'], how='inner')
cities_data


Out[30]:
count freq per_hundred_thousand
SILVER SPRING 8626 36.912149 0.08626
ROCKVILLE 3453 14.775985 0.03453
GAITHERSBURG 3403 14.562027 0.03403
GERMANTOWN 2170 9.285806 0.02170
BETHESDA 1736 7.428645 0.01736
MONTGOMERY VILLAGE 687 2.939792 0.00687
POTOMAC 527 2.255124 0.00527
CHEVY CHASE 498 2.131028 0.00498
OLNEY 380 1.626086 0.00380
KENSINGTON 363 1.553340 0.00363
BURTONSVILLE 304 1.300869 0.00304
DERWOOD 270 1.155377 0.00270
DAMASCUS 230 0.984210 0.00230
CLARKSBURG 173 0.740297 0.00173
TAKOMA PARK 141 0.603363 0.00141
POOLESVILLE 105 0.449313 0.00105
BOYDS 90 0.385126 0.00090
BROOKEVILLE 70 0.299542 0.00070
SANDY SPRING 43 0.184004 0.00043
DICKERSON 26 0.111259 0.00026
ASHTON 19 0.081304 0.00019
CABIN JOHN 18 0.077025 0.00018
SPENCERVILLE 9 0.038513 0.00009
WASHINGTON GROVE 6 0.025675 0.00006
BRINKLOW 5 0.021396 0.00005
GLEN ECHO 4 0.017117 0.00004
BARNESVILLE 4 0.017117 0.00004
MOUNT AIRY 3 0.012838 0.00003
BEALLSVILLE 2 0.008558 0.00002
LAUREL 2 0.008558 0.00002
HYATTSVILLE 1 0.004279 0.00001
KISSIMMEE 1 0.004279 0.00001

Interesting fact is that SILVER SPRING does not appear in the census dataset collected. SILVER SPRING is classified as 'Unincorporated Comunity' and we do not have this kind of classification here. All municipalities are 'Incorporated comunitie' in Brazil.

Ref:https://en.wikipedia.org/wiki/Silver_Spring,_Maryland

Ref:https://en.wikipedia.org/wiki/Unincorporated_community

Bellow there is a distribuition of the amount of crimes per municipalities


In [165]:
fig, axes = plt.subplots(nrows=2, ncols=1,figsize=(40,40))

cities_data['count'].plot(kind='bar',ax=axes[0])
axes[0].set_title('Count of crimes')


Out[165]:
<matplotlib.text.Text at 0x1211408d0>

In [32]:
monty_data['Class'].unique
freq_data = monty_data['Class'].value_counts(normalize=True, sort=True, ascending=False)
freq_data.to_frame
chart_data = monty_data['Class'].value_counts(normalize=True,sort=True,ascending=False)#.to_dict(orient='series')
print(chart_data.head(10))


2812    0.073174
1834    0.057084
2938    0.050965
614     0.039112
617     0.038299
2942    0.035988
1412    0.032607
2941    0.031195
619     0.027900
634     0.026103
Name: Class, dtype: float64

Configuring Bokeh to utilize Google Maps.


In [166]:
map_options2 = GMapOptions(lat=39.151042, lng=-77.193023, map_type="roadmap", zoom=11)

plot2 = GMapPlot(x_range=DataRange1d(), y_range=DataRange1d(), map_options=map_options2)
plot2.title.text = "Montgomery County"

# For GMaps to function, Google requires you obtain and enable an API key:
#
#     https://developers.google.com/maps/documentation/javascript/get-api-key
#
# Replace the value below with your personal API key:
plot2.api_key = "AIzaSyBFHmpkUOfk2FtDZXHVBSUUHp6LVPmI-fs"

In [169]:
violent_data = monty_data[(monty_data['MasterClass']==500) | (monty_data['MasterClass']==700) | (monty_data['MasterClass']==800)]
non_violent_data = monty_data[~((monty_data['MasterClass']==500) | (monty_data['MasterClass']==700) | (monty_data['MasterClass']==800))]

source_violent = ColumnDataSource(
    data=dict(
        lat=violent_data["Latitude"],
        lon=violent_data["Longitude"],
    )
)
source_non_violent = ColumnDataSource(
    data=dict(
        lat=non_violent_data["Latitude"],
        lon=non_violent_data["Longitude"],
    )
)

###print(source.data.values)
circle_red = Circle(x="lon", y="lat", size=3, fill_color="red", fill_alpha=0.8, line_color=None)
circle_blue = Circle(x="lon", y="lat", size=3, fill_color="green", fill_alpha=0.8, line_color=None)
plot2.add_glyph(source_violent, circle_red)
plot2.add_glyph(source_non_violent, circle_blue)
#
plot2.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())

In [170]:
show(plot2,notebook_handle=True)


Out[170]:

<Bokeh Notebook handle for In[275]>


In [155]:
source = ColumnDataSource(
    data=dict(
        lat=latitude_data[13:130],
        lon=longitude_data[13:130],
    )
)

print(source.data.values)
circle = Circle(x="lon", y="lat", size=1, fill_color="blue", fill_alpha=0.8, line_color=None)
plot.add_glyph(source, circle)

plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())


<built-in method values of PropertyValueColumnData object at 0x117c343b8>

Ploting the geographic data in Google Maps. Note that the 'show' function receives another parameter 'notebook_handle=True' responsible for tell Bhoke to do a inline plot


In [58]:
show(plot,notebook_handle=True)


Out[58]:

<Bokeh Notebook handle for In[58]>

Is it possible to make a correlation between the Unemployment rate and the crimes commited?

First of all wme need a good source of information about Unemployment rate. We could find onde dataset about this subject in the


In [43]:
#unemployment = pd.read_csv("Unemployment_MontgomeryCounty_2005to2017.csv",sep=';')
#unemployment